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We consider multivariate two-sample tests of means, where the 
location shift between the two populations is expected to be related 
to a known graph structure. An important application of such tests 
is the detection of differentially expressed genes between two patient 
populations, as shifts in expression levels are expected to be coherent 
with the structure of graphs reflecting gene properties such as biologi- 
cal process, molecular function, regulation or metabolism. For a fixed 
graph of interest, we demonstrate that accounting for graph struc- 
ture can yield more powerful tests under the assumption of smooth 
distribution shift on the graph. We also investigate the identification 
of nonhomogeneous subgraphs of a given large graph, which poses 
both computational and multiple hypothesis testing problems. The 
relevance and benefits of the proposed approach are illustrated on 
synthetic data and on breast and bladder cancer gene expression 
data analyzed in the context of KEGG and NCI pathways. 

1. Introduction. The detection of differentially expressed (DE) genes, 
that is, genes whose expression levels change between two (or more) ex- 
perimental conditions, remains a major challenge in biology and medicine, 
especially in the context of cancer studies. For example, the identification 
of DE genes between breast cancer patients that are sensitive or resistant 
to tamoxifen can help understand resistance mechanisms to this drug and 
eventually improve breast tumor treatment [Loi et al. (2008)]. Similarly, 
finding DE genes between low-grade, noninvasive or more aggressive blad- 
der tumors may help understand the disease better and ultimately improve 
its diagnosis and treatment [Stransky et al. (2006)]. The application of the 
methods developed in this paper will be illustrated on the data sets from 
the above two papers. 
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However, the detection of a change in gene expression levels among a large 
gene list is a difficult problem from a statistical perspective, and lists of 
differentially expressed genes are generally hard to interpret, as they fo- 
cus on the level of genes instead of the level of molecular functions. In 
such a context, expression data from high-throughput microarray and se- 
quencing assays gain much in relevance from their association with graph- 
structured prior information on the genes, for example, Gene Ontology 
(GO; http://www.geneontology.org), Kyoto Encyclopedia of Genes and 
Genomes (KEGG; http://www.genome.jp/kegg) or NCI Pathway Integra- 
tion Database (NCI graphs; http://pid.nci.nih.gov). Most approaches 
to the joint analysis of gene expression data and gene graph data involve two 
distinct steps. First, tests of differential expression are performed separately 
for each gene. Then, these univariate (gene-level) testing results are extended 
to the level of gene sets, for example, by assessing the over-representation 
of DE genes in each set based on p-values for Fisher's exact test 1 (or a x 2 
approximation thereof) adjusted for multiple testing [Beissbarth and Speed 
(2004)] or based on permutation adjusted p- values for weighted Kolmogorov- 
Smirnov-like statistics [Subramanian et al. (2005)]. Another family of meth- 
ods directly performs multivariate tests of differential expression for groups 
of genes, for example, Hotelling's T 2 -test [Lu et al. (2005)]. It is known 
[Goeman and Buhlmann (2007)] that the former family of approaches can 
lead to incorrect interpretations, as the sampling units for the tests in the 
second step become the genes (as opposed to the patients) and these are ex- 
pected to have strongly correlated expression measures. This fact suggests 
that direct multivariate testing of gene set differential expression is more ap- 
propriate than posterior aggregation of individual gene-level tests. On the 
other hand, while Hotelling's T 2 -statistic is known to perform well in small 
dimensions, it loses power very quickly with increasing dimension [Bai and 
Saranadasa (1996)], essentially because it is based on the inverse of the em- 
pirical covariance matrix which becomes ill-conditioned. Additionally, such 
direct multivariate tests on unstructured gene sets do not take advantage of 
information on gene regulation or other relevant biological properties. An 
increasing number of regulation networks are becoming available, specifying, 
for example, which genes activate or inhibit the expression of which other 
genes. If it is known that a particular gene in a tested gene set activates 
the expression of another, then one expects the two genes to have coherent 
(differential) expression patterns, for example, higher expression of the first 
gene in resistant patients should be accompanied by higher expression of the 
second gene in these patients. Accordingly, the first main contribution of this 
paper is to propose and validate multivariate test statistics for identifying 
differential expression patterns (or, more generally, shifts in distribution) 
that are coherent with a given graph structure. 



1 Sometimes referred to as a hypergeometric test in the bioinformatics literature. 
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Next, given a large graph and observations from two data generating 
distributions on the graph, a more general problem is the identification of 
smaller nonhomogeneous subgraphs, that is, subgraphs on which the two 
distributions (restricted to these subgraphs) are significantly different. This 
is very relevant in the context of tests for gene set differential expression: 
given a large set of genes, together with their known regulation network, or 
the concatenation of several such overlapping sets, it is important to discover 
novel gene sets whose expression changes significantly between two condi- 
tions. Currently-available gene sets have often been defined in terms of other 
phenomena than that under study and physicians may be interested in dis- 
covering sets of genes affecting in a concerted manner a specific phenotype. 
Our second main contribution is therefore to develop algorithms that allow 
the exhaustive testing of all the subgraphs of a large graph, while avoiding 
one-by-one enumeration and testing of these subgraphs and accounting for 
the multiplicity issue arising from the vast number of subgraphs. 

As the problem of identifying variables or groups of variables which dif- 
fer in distribution between two populations is closely related to supervised 
learning, our proposed approach is similar to several learning methods. Ra- 
paport et al. (2007) use filtering in the Fourier space of a graph to train 
linear classifiers of gene expression profiles whose weights are smooth on 
a gene network. However, their classifier enforces global smoothness on the 
large regularization network of all the genes, whereas we are concerned with 
the selection of gene sets with locally-smooth expression shift between pop- 
ulations. In Jacob, Obozinski and Vert (2009) and Obozinski, Jacob and 
Vert (2011), sparse learning methods are used to build a classifier based on 
a small number of gene sets. While this approach leads in practice to the 
selection of groups of variables whose distributions differ between the two 
classes, the objective is to achieve the best classification performance with 
the smallest possible number of groups. As a result, correlated groups of 
variables are typically not selected. Other related work includes Fan and 
Lin (1998), who proposed an adaptive Neyman test in the Fourier space for 
time series. However, as illustrated below in Section 5, direct translation of 
the adaptive Neyman statistic to the graph case is problematic, as assump- 
tions on Fourier coefficients which are true for time series do not hold for 
graphs. In addition, the Neyman statistic converges very slowly toward its 
asymptotic distribution and the required calibration by bootstrapping ren- 
ders its application to our subgraph discovery context difficult. By contrast, 
other methods do not account for shift smoothness and try to address the 
loss of power caused by the poor conditioning of the T 2 -statistic by applying 
it after dimensionality reduction [Ma and Kosorok (2009)] or by omitting 
the inverse covariance matrix and adjusting instead by its trace [Bai and 
Saranadasa (1996), Chen and Qin (2010)] or using a diagonal estimator of 
the covariance matrix [Srivastava and Du (2008), Srivastava (2009)]. Lopes, 
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Jacob and Wainwright (2011) recently proposed a testing procedure based 
on random projection of the data in a lower dimension space, and showed 
that it was asymptotically more powerful than Bai and Saranadasa (1996), 
Chen and Qin (2010) and Srivastava and Du (2008) in the presence of corre- 
lation and when the spectrum of the covariance matrix decays fast enough. 
Vaske et al. (2010) recently proposed DE tests, where a probabilistic graph- 
ical model is built from a gene network. However, this model is used for 
gene-level DE tests, which then have to be combined to test at the level of 
gene sets. Several approaches for subgraph discovery, like that of Ideker et al. 
(2002), are based on a heuristic to identify the most differentially expressed 
subgraphs and do not amount to testing exactly all possible subgraphs. 
Concerning the discovery of distribution-shifted subgraphs, Vandin, Upfal 
and Raphael (2010) propose a graph Laplacian-based testing procedure to 
identify groups of interacting proteins whose genes contain a large number 
of mutations. Their approach does not enforce any smoothness on the de- 
tected patterns (smoothness is not necessarily expected in this context) and 
the graph Laplacian is only used to ensure that very connected genes do not 
lead to spurious detection. The Gene Expression Network Analysis (GXNA) 
method of Nacu et al. (2007) detects differentially expressed subgraphs based 
on a greedy search algorithm and gene set DE scoring functions that do not 
account for the graph structure. 

The rest of this paper is organized as follows. Section 2 explains how 
to build a lower-dimension basis in which to apply the multivariate test of 
means. Section 3 presents our graph-structured two-sample test statistic and 
states results on power gain for smooth-shift alternatives. Section 4 describes 
procedures for systematically testing (without fully enumerating) all possible 
subgraphs of a large graph. Section 5 presents results for synthetic data and 
Section 6 on breast and bladder cancer gene expression data sets analyzed 
in the light of pathways from the KEGG and NCI databases. Section 7 
presents softwares implementing the proposed methods. Finally, Section 8 
summarizes our findings and outlines ongoing work. 

Although this work is motivated by the specific question of differential 
expression testing of gene networks, our proposed structured two-sample test 
of means on a graph and our nonhomogeneous subgraph discovery algorithm 
can actually be used in any situation where one searches for differences 
between two populations that are expected to be coherent with a known 
graph structure. Therefore, our methodological contributions in Sections 3 
and 4 are presented in the general context of two-sample tests on graphs. 

2. Graph-based dimensionality reduction. As stated in the Introduction, 
each of the two main paradigms for testing differential expression of a gene 
set have their limitations. Two-step methods generally do not directly test 
the existence of a mean shift between two multivariate distributions [Goe- 
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Fig. 1. Synthetic example of the joint distribution of the expression measures of two 
genes in two patient populations. The color and shape of the plotting symbols indicate the 
patient group and the x- and y-axes correspond to the expression measures of the first and 
second gene, respectively. 



man and Biihlmann (2007)]. The second step, which often treats the genes 
as the sampling units, renders the interpretation of p- values problematic 
and may lead to a large loss of power or Type I error control when sets of 
genes have correlated expression. Multivariate statistics, on the other hand, 
allow a direct formulation of and solution to the testing question: the sam- 
pling units are vectors of gene expression measures (e.g., corresponding to 
patients) and the question is whether two such sets of random vectors are 
likely to have arisen from distributions with equal means. Figure 1 illus- 
trates another classical advantage of multivariate approaches: genes taken 
individually may have extremely small mean shifts between two populations, 
although their joint distributions clearly differ between the two populations. 
Here, again, this phenomenon typically happens for sets of genes whose 
expression measures are correlated, which is not unlikely for pathways or 
annotated gene sets. 

Unfortunately, with moderate sample sizes, multivariate statistics lose 
power quickly in a high dimension. If some type of side information is avail- 
able regarding particular properties of the expression shift, a possible ap- 
proach to get the best of both worlds would be to: (1) project the vectors 
of covariates in a new space of lower dimension that preserves the distribu- 
tion shift, that is, the distance between the expression measures of the two 
groups, and (2) apply the multivariate statistic in this new space. One could 
thus perform the appropriate multivariate test, while avoiding the loss of 
power caused by the high-dimensionality of the original covariate space. 

A possible source of information about the expression shift is the growing 
number of available gene networks. Indeed, while the difference in mean ex- 
pression between two groups of patients may not be entirely coherent with 
an existing network (e.g., because of noise in the data, errors in the annota- 
tion, or inappropriateness of the chosen network for the biological question 
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of interest), it is reasonable to expect that this shift will not be entirely 
contradictory with the given graph structure. For example, repressed genes 
should be more connected to other repressed genes than to overexpressed 
genes. Given this assumption, we intend to build a space of lower dimension 
than the original gene space, but which preserves most of the distribution 
shift between the two populations. 

More precisely, consider a network of p genes, represented by a graph 
Q = (V,£), with |V| = p nodes and edge set £. Let 5gR p denote the mean 
shift, that is, the vector of differences in mean expression measures for these p 
genes between the two populations of interest. Suppose we expect the shift 5 
to be coherent with the graph Q, in the sense that it has low energy Eg (5) for 
a particular energy function Eg defined on Q. Then, we wish to build a space 
of lower dimension k <C p capturing most of the low energy functions. To this 
end, we start by finding the function that has the lowest possible energy, 
then the function that has lowest possible energy in the orthogonal space of 
the first one, up to the kth function with lowest energy in the orthogonal 
subspace of the first k — 1 functions. That is, for each i < k, we define 

f argmm Eg (f) 
(2.1) Ui = < 

[ such that U{ -Luj,j <i. 

If Eg is a positive semi-definite quadratic form Eg (5) = S T Qg5, for some 
positive semi-definite matrix Qg = UAU T , where U is an orthogonal matrix 
and A a diagonal matrix with elements Xi,i = 1, . . . ,p, then the solution to 
equation (2.1) is given by the k eigenvectors of Qg corresponding to the 
smallest k eigenvalues. It is easy to check that these eigenvalues are the 
energies of the corresponding functions Ui, that is, Eg(ui) = Aj. 

Different choices of Qg lead to different notions of coherence of the ex- 
pression shift with the network. A classical choice is the graph Laplacian C 
Suppose Q is an undirected graph with adjacency matrix A, with a™ = 1 if 
and only if S E and a™ = otherwise, and degree matrix D = Diag(-Al), 
where 1 is a unit column- vector, Diag(x) is the diagonal matrix with diag- 
onal x for any vector x, and Da = di. The Laplacian matrix of Q is then 
typically defined as C = D — A or £ norm = I — D~ x l 2 AD~ X I 2 for the normal- 
ized version, leading to energies 52 ijjeV (5i - 5j) 2 and ^ iJeV (-^= - ~^j=) 2 ■, 

respectively. Note that, in this case, the Laplacian matrix C, energy E and 
basis functions Ui extend the classical Fourier analysis of functions on Eu- 
clidean spaces to functions on graphs, by transferring the notions of Laplace 
operator, Dirichlet energy and Fourier basis, respectively [Evans (1998)]. 

More generally, any positive semi-definite matrix can be chosen. In the 
case of gene regulation networks, we do not necessarily expect as strong 
a coherence as that corresponding to the Dirichlet energy defined by the 
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graph Laplacian, since some of the annotated interactions may not be rel- 
evant in the studied context and some antagonist interactions may cancel 
each other. For example, if a gene is activated by two others, one who is 
underexpressed and the other over expressed, we may observe no change in 
the expression of the gene, but a nonzero Dirichlet energy ^ jevi^ ~~ fy) 2 - 
Additionally, for applications like structured gene set differential expression 
detection, one may use negative weights for edges that reflect a negative 
correlation between two variables, for example, a gene i whose expression 
inhibits the expression of another gene j. In this small variation of 

the shift on the edge between % and j should correspond to a small \Si + <5j-|. 
This can be achieved in the same formalism by simply considering a signed 
version of the adjacency matrix A, that is, tt,j = 1 if gene i activates gene j 
and — 1 if it inhibits gene j . A signed version of the graph Laplacian is then 
£ s i g n = D — A, where D = Diag( |^4|1) is the degree matrix and \A\ denotes 
the entry-wise absolute value of A. Note that such a signed Laplacian was 
used as a penalty for semi-supervised learning in Goldberg (2007). 

In the context of this work, we, moreover, consider directed graphs Q = 
(V,£), where the edge set E consists of ordered pairs of nodes. The adjacency 
matrix A may be asymmetric, with entries aij ^ if and only if £ £ , 
that is, there is an (directed) edge pointing from node vi to node Vj. We 
then use the following energy function: 



where d~ = Yl P j=i\ a ji\ ls the indegree of node Vi, that is, the number of 
directed edges pointing from any node to Vi. According to this definition, an 
expression shift will have low energy if the difference in mean expression of 
any given gene between the two populations is similar to the (signed) average 
of the differences in mean expression for the genes that either activate or 
inhibit it. 

It is immediate to check that Eg(5)=5 T MgS, with Mg = (I-Dl 1 A T ) T (I- 
DZ 1 A T ), where D_ = Diag((d~)i = i i ... jP ) is the matrix of indegrees, / = 
Diag((I(d7 7^ fy)i=l,...,p) is a modification of the identity matrix where di- 
agonal elements corresponding to nodes with zero indegree are set to zero, 
and the value of the indicator function I is 1 if its argument is true and 
zero otherwise. Note that a very similar function was used in the context of 
regularized supervised learning by Sandler et al. (2009). 

Following our principle to build a lower dimension space, we use the first 
few eigenvectors of Mg to obtain orthonormal functions with low energy. As 
an example, Figure 2 displays the eigenvectors of Mg for a simple four-node 
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Fig. 2. Eigenvectors of the signed Laplacian £. s ign for the simple undirected four-node 
graph of example (2.3). The eigenvectors of Mg for this particular network are the same. 
The corresponding eigenvalues are 0, l,l,4j for Mg and 0,1,1,4 for £ B i g n- Nodes are 
colored according to the value of the eigenvector, where green corresponds to high positive 
values, red to high negative values, and black to 0. "T" -shaped edges have negative weights. 
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where A takes on negative values for negative interactions, such as expres- 
sion inhibition. The first eigenvector, corresponding to the smallest energy 
(eigenvalue of zero), can be viewed as a "constant" function on the graph, 
in the sense that its absolute value is identical for all nodes, but nodes con- 
nected by an edge with negative weight take on values of opposite sign. By 
contrast, the last eigenvector, corresponding to the highest energy, is such 
that nodes connected by positive edges take on values of opposite sign and 
nodes connected by negative edges take on values of the same sign. Note 
that, for this particular example, the adjacency matrix is symmetric, which 
need not always be the case. Here, the signed Laplacian turns out to have 
the same eigenvectors as Mg, which is not the case generally. 

Consider now a slightly different graph, with directed edges, only positive 
interactions to avoid confusion, and adjacency matrix 
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For this graph, Figure 3 shows that the two notions of energy lead to two 
different bases. While the signed Laplacian matrix (by definition based on 
a symmetrized version of A for an undirected graph) has only one (constant) 
eigenvector of null energy, two of energy 1, and one of 4, Mg has three 
orthogonal vectors of null energy. Note, however, that the first and last 
eigenvectors are still the same across the two bases. 
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Fig. 3. Eigenvectors of the signed Laplacian £ s i gn (top) and of Mg (bottom) for the sim- 
ple directed four-node graph of example (2.4)- The corresponding eigenvalues are 0,1,1,4 
and 0, 0, 0, | , respectively. Nodes are colored according to the value of the eigenvector, 
where green corresponds to high positive values, red to high negative values, and black to 0. 

More generally, this illustration suggests that projecting on the first eigen- 
vectors of Mg will not preserve the same shifts as projecting on the first 
eigenvectors of £ s ign- It is possible for a shift vector to have low energy (2.2) 
but larger signed Dirichlet energy ^ j e v(^« ~~ a-ijdj) 2 , where we recall that a^- 
is 1 for an edge indicating a positive interaction between i and j and —1 for 
an edge indicating a negative interaction. This is, for example, the case of 
the second eigenvector of Mg on the bottom row of Figure 3. It is therefore 
conceivable that such a shift essentially lies in the space spanned by the first 
few eigenvectors of Mg, but that its projection in the space formed by the 
first few eigenvectors of £ s igned is smaller. As a consequence, for a particular 
shift using one basis or the other for dimensionality reduction will lead to 
more or less gain in power, which means that the choice of basis should be 
adapted to the expected type of smoothness of the shift. 

While we introduce the idea in the context of gene regulation networks 
and testing for differential expression, the same dimensionality reduction 
principle applies to any multivariate testing problem for which the variables 
have a known structure, as represented by a graph. 

As a last remark, we emphasize that our requirement that the shift be 
coherent with the network is not too strict in practice. It may sound like 
most pairs of nodes must have shifts whose directions are consistent with 
the nature of the edge connecting the nodes, but: 

• In practice, keeping a few eigenvectors already allows to represent several 
types of shifts which are not perfectly coherent with the network, as illus- 
trated on Figure 2. The projection only shrinks those shifts which severely 
contradict the prior given by the network. 
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• In Section 5.2 we illustrate the fact that this type of projection still leads 
to gain in power even in case of strong misspecifications in the network, 
that is, when a lot of edges are missing or wrong. 

• Lopes, Jacob and Wainwright (2011) show that in a high dimension, ran- 
dom projection of the data in a lower dimension space yields gains in power 
against the regular Hotelling T 2 in the presence of correlation and if the 
spectrum of the covariance matrix decays fast enough. This result sug- 
gests that there is hope to gain power even in the case where the network 
doesn't bring much information about the shift. 

In the remainder of this paper we denote by / = U T f the coefficients of 
a vector / 6 K' v ' after projection on a basis U (typically the eigenvectors of 
a Qg matrix). 

3. Graph-structured two-sample test of means under smooth-shift alter- 
natives. For multivariate normal distributions, Hotelling's T 2 -test, a clas- 
sical test of location shift, is known to be a uniformly most powerful in- 
variant against global-shift alternatives. The test statistic is based on the 
squared Mahalanobis norm of the sample mean shift and is given by T 2 = 
ni+n 2 ^ 1 ~~ x 2 ) T T l ~ 1 (xi — x 2 ), where rij, x~i and £ denote, respectively, the 
sample sizes, means and pooled covariance matrix, for random samples 
drawn from two p-dimensional Gaussian distributions, J\f(fAi, £), i = 1,2. 
Under the null hypothesis Ho : ^1 = ^2 of equal means, NT 2 follows a (cen- 
tral) F-distribution Fq(p, n\ + n 2 — p— 1), where N = ?n^+n2-i)p ' ^ n S enera ^ 
NT 2 follows a noncentral F-distribution F( ™ 1 ™£ A 2 (8, ni + n 2 —p—1), 
where the noncentrality parameter is a function of the Mahalanobis norm of 
the mean shift 5, A 2 (5, £) = 5 T T>~ 1 5, which we refer to as the distribution 
shift. In the remainder of this paper, unless otherwise specified, T 2 -statistics 
are assumed to follow the nominal F-distribution, for example, for critical 
value and power calculations. 

For any orthonormal basis U and, in particular, for our graph-based basis, 

direct calculation shows that T 2 =f 2 = ^^(xi-x 2 ) T U{U T tuy 1 U J \x x - 

x 2 ), that is, the statistic T 2 in the original space and the statistic T 2 in the 
new graph-based space are identical. More generally, for k <p, the statistic 
in the original space after filtering out dimensions above k is the same as the 
statistic T 2 restricted to the first k coefficients in the new space defined by U : 

Tk = n Hl ™l (5i " x 2 Y U^U^tU^Y^px - x 2 ) 
nin 2 _ s T 



(x! - x 2 ) T Ul k U T (Ul k U T tUl k U T yUl k U T ( Xl - x 2 ) 



m +n 2 

where A + denotes the generalized inverse of a matrix A, the p x k matrix t/rw 
denotes the restriction of U to its first k columns, and l k is a p x p diagonal 
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matrix, with ith diagonal element equal to one if i < k and zero otherwise. 
Note that, as retaining the first k dimensions corresponds to a noninvert- 
ible transformation, this filtering indeed has an effect on the test statistic, 
that is, we have T% 7^ T 2 in general. As the Mahalanobis norm is invari- 
ant to invertible linear transformations, using an invertible filtering (such as 
weighting each component according to its corresponding eigenvalue) would 
have no impact on the test statistic. 

Hotelling's T 2 -test is known to behave poorly in the high dimension; 
Lemma 1 stated and proved in the supplemental article Supplement A [Ja- 
cob, Neuvial and Dudoit (2011a)] shows that gains in power can be achieved 
by filtering. Specifically, let 5 = U T 5 and E = U T T,U denote, respectively, 
the mean shift and covariance matrix in the new space. Given k < p, let 
A|(5, E) = (S[ fc ]) _1 5^] denote the distribution shift restricted to the first k 
dimensions of 5 and E, that is, based on only the first k elements of 5, 
(Si : i < k) , and the first k x k diagonal block of S, ((Tjj : i,j <k). Under 
the assumption that the distribution shift is smooth, that is, lies mostly in 
the first few graph-based coefficients, so that A|(5, E) is nearly maximal for 
a small value of k, Lemma 1 states that performing Hotelling's test in the 
new space restricted to its first k components yields more power than testing 
in the entire new space. Equivalently, the test is more powerful in the origi- 
nal space after filtering than in the original unfiltered space. The increase in 
shift rj(a, k, I) required to maintain power when increasing dimension can be 
evaluated numerically for any (a,k,l). Note that this result holds because 
retaining the first k new components is a noninvertible transformation. 

Corollary 1 in the supplemental article Supplement A [Jacob, Neuvial and 
Dudoit (2011a)] states that if the distribution shift lies in the first k new 
coefficients, then testing in this subspace yields strictly more power than 
using additional coefficients. In particular, if there exists k < p such that 
Sj = Vj > k (i.e., the mean shift is smooth) and E is block-diagonal such 
that &ij = Vi < k,j > k, then gains in power are obtained by testing in the 
first k new components. Although nonnecessary, this condition is plausible 
when the mean shift lies at the beginning of the spectrum (i.e., has low 
energy), as the coefficients which do not contain the shift are not expected 
to be correlated with the ones that do contain it. 

Note that the result in Lemma 1 is more general, as testing in the first k 
new components can increase power even when the distribution shift par- 
tially lies in the remaining components, as long as the latter portion is below 
a certain threshold. Figure 4 illustrates, under different settings, the increase 
in distribution shift necessary to maintain a given power level against the 
number of added coefficients. 

Under the assumption of block-diagonal covariance, Corollary 2 (in the 
supplemental article Supplement A [Jacob, Neuvial and Dudoit (2011a)]) 
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Fig. 4. Left: Increase in distribution shift required for Hotelhng's T 2 -test to maintain 
a given power when increasing the number of tested new coefficients: A^ +t — A| vs. I 
such that /3 ay k+i(Ak+i) = Ax,*(Afc). Power /? a ,fc+;(A| +i ) computed under the noncentral 
F -distribution F(^~^Af. +{ ; k + I, n\ + n% — (k + I) — 1), for m=ri2 = 100 observations, 
k = 5, and a = 10~ 2 . Each line corresponds to the fixed shift A| and power /3 a> fc(A|) pair 
indicated in the legend. Right: Zoom on the first 30 dimensions. 



directly relates the energy of the mean shift vector to the gain in power. It 
states that if the energy of the mean shift vector 5 is small enough, that 
is, if the mean shift is coherent enough with the network, then testing in 
the first k dimensions of the new basis is more powerful than testing in the 
original space. The corresponding upper bound on the mean shift energy can 
be quantified for a given generative setting (/xi,/X2,E), graph Q and level a. 
Tighter and looser bounds can be straightforwardly derived using the same 
principle for the diagonal and general covariance cases, respectively. 

4. Nonhomogeneous subgraph discovery. A systematic approach for dis- 
covering nonhomogeneous subgraphs, that is, subgraphs of a large graph that 
exhibit a significant shift in means, is to test all of them one-by-one. 

This poses a huge combinatorial problem even for moderately large graphs 
(p = 50, say), as the number of (connected) subgraphs of size A; of a graph 
of size p can be exponential in p and k. Exhaustive search is therefore not 
feasible in practice, especially for differential expression on gene networks, 
where p is typically in the dozens or hundreds of genes. Therefore, it is impor- 
tant to rapidly identify sets of subgraphs that all satisfy the null hypothesis 
of equal means. To this end, we prove an upper bound on the value of the test 
statistic for any subgraph containing a given set of nodes (Lemma 2 in the 
supplemental article Supplement A [Jacob, Neuvial and Dudoit (2011a)]). 
An exact algorithm is derived from this upper bound in Section 4.1, and 
a quicker, approximate algorithm is proposed in Section 4.2. 
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Algorithm 1: Nonhomogeneous subgraph discovery algorithm. The sub- 
graphBoundary of a subgraph g of Q is defined as the set of supergraphs 
of g obtained by adding any one node of Q which is connected to a node 

of 9- 

Input: G,Xi,X 2 ,a,q 
Output: selectedSubgraphs 

1 selectedSubgraphs = 0; 

2 previousSubgraphs = nodes (Q); 

3 prunedSubgraphs = 0; 

4 foreach s € {1 . . . q — 1} do 



5 


checkedSubgraph 


s = 0; 


6 


foreach previousSubgraph do 


7 




foreach subgraph E subgraphBoundary(previousSubgraph) do 


8 






if subgraph has been checked or has a pruned subgraph then next 


9 






if s < q - 


1 then 


10 








if upperBound(subgraph, C7, Xi,X2,q) <T^ k then 


11 










add subgraph to prunedSubgraphs; 


12 








else 




13 










add subgraph to currentSubgraphs; 


14 








end 




15 






else 






16 








foreach q-subgraph G subgraphBoundary(subgraph) do 


17 










if q-subgraph has been checked or has a pruned subgraph 
then 


18 












next 


19 










else 


20 












if f fc 2 (q-subgraph,A: 1 ,A: 2 ) > T£ fc then 


21 












add q-subgraph to selectedSubgraphs 


22 












end 


23 












add q-subgraph to checkedSubgraphs 


24 










end 


25 








end 




26 






end 






27 






add subgraph to checkedSubgraphs 


28 




end 








29 


end 










30 


set previousSubgraphs to currentSubgraphs 


31 end 













4.1. Exact algorithm. Given a large graph Q with p nodes, we adopt 
a branch-and-bound-like approach [Land and Doig (I960)] to test subgraphs 
of size q < p at level a, as described in Algorithm 1. We start by checking, 
for each node in Q, whether the Hotelling T 2 -statistic in the first k new 
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components of any subgraph of size q containing this node can be guaran- 
teed (by virtue of Lemma 2) to be below the level-a critical value T 2 fc , for 
example, (1 — a)-quantile of Fo(k, n\ + n<i — k — 1) distribution. If this is the 
case, the node is pruned, that is, removed from the graph. The algorithm 
iteratively enriches a list of pruned subgraphs and a list of candidate sub- 
graphs (called prunedSubgraphs and current Subgraphs in Algorithm 1, 
resp.) of increasing number of nodes s = 1, . . . ,q — 1. Pruned subgraphs are 
those for which one can guarantee that no supergraph of size q can reach 
significance level a, and candidate subgraphs are those for which this guar- 
antee cannot be given. The key of the algorithm is that at step s, only those 
graphs containing a candidate subgraph have to be considered. 

This guarantee is obtained by applying Lemma 2 in the supplemental 
article Supplement A [Jacob, Neuvial and Dudoit (2011a)], which gives an 
upper bound on the value of the test statistic for any subgraph containing 
a given set of nodes. For a subgraph g of Q of size q <p, Hotelling's T 2 - 
statistic in the first k < q new components of g is defined as 

where Usu is the q x k restriction of the matrix of q eigenvectors of Q g 
to its first k columns [i.e., U^(g), where we omit g to ease notation] and 

Xi(g),i = 1,2, and £(g) are, respectively, the empirical means and pooled 
covariance matrix restricted to the nodes in g. Lemma 2 states that for any 
number k of retained components, and for any subgraph g' of size q' of g, 
T£(g) is upper bounded by the T 2 statistic of the subgraph whose nodes are 
in v(g' , q — q'), that is, the union of the nodes of g' and the nodes of g whose 
shortest path to a node of g' is less than or equal to r. The set u(g',r) is 
called the r-neighborhood of g' . As & corollary of Lemma 2, the subgraphs 
returned by Algorithm 1 are exactly those who exhibit a significant shift in 
means at the prescribed level a. 

Note that the bound in Lemma 2 takes into account the fact that the T 2 - 
statistic is eventually computed in the first few components of a basis which 
is not known beforehand: at each step, for each potential subgraph g 1 which 
would include the subgraph g which we consider for pruning, theT 2 (#') that 
needs to be bounded above depends on Q g > . 

4.2. Mean-shift approximation. For "small-world" graphs above a cer- 
tain level of connectivity and q large enough, v(g' ,q — s), the (q— s)-neighbor- 
hood of g' , tends to be large, at least at the beginning of the above exact 
algorithm, and the number of tests actually performed may not decrease 
much compared to the total number of possible tests. One can, however, 
identify much more efficiently the subgraphs whose sample mean shift in 

the first k components of the new space has Euclidean norm ||5[fc](s0|| = 
||J7rL (xi(g) — X2(g))\\ above a certain threshold. Indeed, it is straightforward 
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to see that 

WU^ig) - x 2 (g))\\ 2 
<\\U T {x x {g)-x 2 {g))\\ 2 

= (ff) - x 2 {g)\\ 2 

<||xi(^)-x 2 (^)|| 2 

+ max \\x 1 (vi,...,v q _ s )-x 2 (v 1 ,...,v q - s )\\ 2 . 

vi,...,v q - a eu{g',q-s) 

Using this inequality yields an upper bound on T%(g) that can be used as 
upperBound at line 10 of Algorithm 1. This defines a procedure that iden- 
tifies all subgraphs for which the Euclidean norm of the sample mean shift 

exceeds a given threshold: ||<5^](g)|| 2 > 9. For any a, if this threshold 9 is 
low enough, all the subgraphs with T 2 (g) > T 2 fc are included in this set. 
Performing the actual T 2 -test on these preselected subgraphs then yields 
exactly the set of subgraphs that would have been identified using the ex- 
act procedure of Section 4.1. More precisely, Lemma 4, in the supplemental 
article Supplement A [Jacob, Neuvial and Dudoit (2011a)], states that for 
any subgraph which would be detected by Hotelling's T 2 -statistic T 2 {g) but 

not by the Euclidean criterion ||<5[fc]((?)|| 2 , the sample covariance matrix in 
the restricted new space (after filtering) has an eigenvalue below a certain 
threshold. This implies that such false negative subgraphs (from the Eu- 
clidean approximation to the exact algorithm) have a small mean shift in 
the new space, but in a direction of small variance. In the context of gene 
expression, this is related to the well-known issue of the detection of DE 
genes by virtue of their small variances. Even though the differences in ex- 
pression appear to be significant for these genes, they correspond to small 
effects that may not be interesting from a practical point of view (i.e., bio- 
logically nonsignificant). Methods for addressing this problem are proposed 

in Lonnstedt and Speed (2002). Note that A m i n (X(g)) < A m i n (Snu (<?)); thus, 
the remark on variances holds for both the new and the original spaces. 
However, if q is large, we expect A m i n (£(<?)) to be very small, while filtering 
somehow controls the conditioning of the covariance matrix. 

4.3. Multiple hypothesis testing. Testing for homogeneity over the poten- 
tially large number of subgraphs investigated as part of the above algorithms 
immediately raises the issue of multiple testing. However, because one does 
not know in advance the total number of tests and which tests will be per- 
formed specifically, standard multiple testing procedures, such as those in 
Dudoit and van der Laan (2008), are not immediately applicable. 

In an attempt to address the multiplicity issue, we apply a permuta- 
tion procedure to control the number of false positive subgraphs under the 
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complete null hypothesis of identical distributions in the two populations. 
Specifically, one permutes the class/population labels (1 or 2) of the n\ + ri2 
observations and applies the nonhomogeneous subgraph discovery algorithm 
to the permuted data to yield a certain number of false positive subgraphs. 
Repeating this procedure a sufficiently large number of times produces an es- 
timate of the distribution of the number of Type I errors under the complete 
null hypothesis of identical distributions. 

We evaluate the empirical behavior of the procedures proposed in Sec- 
tions 3 and 4, first on synthetic data, then on breast cancer microarray data 
analyzed in the context of KEGG pathways. 

5. Results on synthetic data. The performance of the graph-structured 
test is assessed in cases where the distribution shift A 2 satisfies the smooth- 
ness assumptions described in Section 3. We first generate a connected ran- 
dom graph Q with p = 20 nodes and 20 edges. Next, we generate 10,000 
data sets in the space corresponding to the basis U defined by the eigenvec- 
tors of the Qg matrix for the graph Q; an inverse transformation is applied 
to random vectors generated is this new space. Each data set comprises 
n\= n<i = 20 Gaussian random vectors in MP, with null mean shift 6 for 
5000 data sets and nonnull mean shift 5 for the remaining 5000. For the 
latter data sets, the nonzero shift is built in the first ko = 3 graph-based 
coefficients (the shift being zero for the remaining p — ko coefficients): <5j ^ 
if and only if * < k and A 2 (5, E) = A 2 (5, E) = 5 T t^5 = 1. We consider two 
covariance settings. In the first one, the covariance matrix in the new space 
is diagonal, with diagonal elements equal to In the second, correlation 
is introduced between the shifted coefficients only. Specifically for i, j < ko, 
^ij = 7§ if i ^ 3> s «i = otherwise. 

5.1. Fixed known network. Figure 5 displays receiver operator charac- 
teristic (ROC) curves for mean shift detection by the standard Hotelling 
T 2 -test, T 2 in the first ko graph-based coefficients, T 2 in the first ko prin- 
cipal components (PC), the adaptive Neyman test of Fan and Lin (1998), 
and a modified version of this fourth test where the correct value of fco is 
specified. Note that we do not consider sparse learning approaches [Jacob, 
Obozinski and Vert (2009), Jenatton, Audibert and Bach (2009), Obozinski, 
Jacob and Vert (2011)], but it would be straightforward to design a realistic 
setting where such approaches are outperformed by testing, for example, by 
adding correlation between some of the functions under Hi. 

The first important comparison is between the classical Hotelling T 2 - 
test vs. the T 2 -test in the new graph-based space (Figure 5, top row). As 
expected from Lemma 1, testing in the restricted space where the shift lies 
performs much better than testing in the full space which includes irrelevant 
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Fig. 5. Synthetic data: ROC curves for the detection of a smooth shift. Left: Diagonal 
covariance structure. Right: Block-diagonal covariance structure. Top: Comparison of tests 
based on the standard Hotelling T 2 -statistic in the original space, T 2 -statistic in the first 
ko graph-based coefficients, and T 2 -statistic in the first ko principal components. Mid- 
dle: Comparison with the statistics of Bai and Saranadasa (1996) (BS), Chen and Qm 
(2010) (CQ), and Srivastava and Du (2008) (SD). Bottom: Comparison with the Neyman 
statistics of Fan and Lin (1998). 
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Fig. 6. Synthetic data: Sensitivity to choice of k. Power of the T 2 -test in the first k 
graph-based coefficients for a graph of 20 nodes, when the actual distribution shift A 2 = 1 
is evenly distributed among the first ko = 5 graph-based coefficients and with n\ = ri2 = 20. 

coefficients. The difference can be made arbitrarily large by increasing the 
dimension p and keeping the shift unchanged. The graph-structured test re- 
tains a large advantage even for moderately smooth shifts, for example, when 
ko = 3 and p = 5. Of course, this corresponds to the optimistic case where 
the number of shifted coefficients ko is known. Figure 6 shows the power of 
the test in the new space for various choices of k. Even when missing some 
coefficients (k < k$) or adding a few irrelevant ones (k > ko), the power of 
the graph-structured test is higher than that of the T 2 -test in the full space. 
The principal component approach is shown in Figure 5 (top row) because it 
was proposed for the application which motivated our work [Ma and Kosorok 
(2009)] and because it also illustrates that the improvement in performance 
originates not only from dimensionality reduction, but also from the fact 
that this reduction is in a direction that does not decrease the shift. We 
emphasize that power entirely depends on the nature of the shift and that 
a PC-based test would outperform our graph-based test when the shift lies 
in the first principal components rather than graph-based coefficients. 

The panels in the middle row show that the statistics of Bai and Saranadasa 
(1996), Chen and Qin (2010) and Srivastava and Du (2008) are also largely 
outperformed by our graph-structured statistic. This observation suggests 
that when such a graph-based prior on the shift is available, working in the 
new, lower-dimensional space does better at solving the problem of high- 
dimensionality than methods based on diagonal approximations of the co- 
variance matrix. In addition, as one could expect, the procedures of Bai and 
Saranadasa (1996), Chen and Qin (2010) and Srivastava and Du (2008) per- 
form very poorly in the presence of correlation. Here again, for a nonsmooth 
shift, the comparison would be less favorable to our procedure. We also con- 
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sidered the recently-proposed random projection approach of Lopes, Jacob 
and Wainwright (2011). Random projection was shown to give more power 
than Bai and Saranadasa (1996), Chen and Qin (2010) and Srivastava and 
Du (2008) in high-dimensional cases. However, as expected in our simulation 
setting where the sample size is twice the number of dimensions, it did not 
improve upon the Hotelling T 2 -test (ROC curve not shown for the sake of 
readability). The method of Lopes, Jacob and Wainwright (2011) is more 
appropriate in a higher dimension and when no prior on the shift direction 
is available. 

Finally, we consider the adaptive Neyman test of Fan and Lin (1998) 
(bottom two panels of Figure 5), which takes advantage of smoothness as- 
sumptions for time series. This test differs from our graph-structured test, 
as Fourier coefficients for stationary time series are known to be asymptot- 
ically independent and Gaussian. For graphs, the asymptotics would be in 
the number of nodes, which is typically small, and necessary conditions such 
as stationarity are more difficult to define and unlikely to hold for data such 
as gene expression measures. In the uncorrelated setting, the modified ver- 
sion of the Fan and Lin (1998) statistic based on the true number of nonzero 
coefficients performs approximately as well as the graph-structured T 2 . How- 
ever, for correlated data, it loses power and both versions of the Neyman test 
can have arbitrarily degraded performance. This, together with the need to 
use the bootstrap to calibrate the test, illustrates that direct transposition 
of the Fan and Lin (1998) test to the graph context is not optimal. 

5.2. Fixed network with errors. We now consider the less idealistic case 
where the network used for testing is not exactly the one which was used to 
generate the data. More precisely, we follow the same procedure as in the 
correlated case of Section 5.1, but remove or add some edges to the network 
between the moment where we use it to generate the two samples and the 
moment where we use it in our testing procedure. This setting is much closer 
to what is likely to happen with real data, as available networks may miss 
several gene interactions which are not known yet and may include some 
incorrect interactions or some which are irrelevant for the problem under 
consideration. It is easy to see that in a worst case scenario, removing or 
adding an edge to the network can arbitrarily shrink the T 2 statistic. Take, 
for example, two disconnected nodes and assume without loss of generality 
that the empirical covariance matrix is the identity matrix and the empirical 
mean shifts for the two nodes are 1 and —1. Then, 5 T 5 is 2, but adding 
an edge between the two nodes and projecting on the first eigenvector of 
the graph Laplacian matrix shrinks the observed shift to 0. A probabilistic 
analysis over random perturbations would be out of the scope of this paper, 
but the following simulation study is intended to give insight into what 
would happen in practice if randomly chosen edges are either wrongly added 
or omitted. 




Fig. 7. Synthetic data: ROC curves for the detection of a smooth shift in the presence 
of errors in the network. Comparison of tests based on the standard Hotelling T 2 -statistic 
in the original space and T 2 -statistic in the first k graph-based coefficients. Top: After 
randomly removing 20 (left) and 40 (right) edges. Bottom: After randomly adding 20 (left) 
and 40 (right) edges. 



For the sake of clarity, Figure 7 only shows ROC curves for our graph- 
structured T 2 with k = 2,3,4, and the standard Hotelling T 2 -statistic. The 
other competitors [Bai and Saranadasa (1996), Ma and Kosorok (2009), 
Chen and Qin (2010), Fan and Lin (1998), Srivastava and Du (2008), Lopes, 
Jacob and Wainwright (2011)] considered above all perform similarly to the 
Hotelling T 2 -statistic. In the case where edges are erroneously removed, we 
start with a true network having 60 edges instead of 20 in Section 5. Figure 7 
shows that our graph-based approach can still perform much better than 
all competing methods, even in cases where the topology of the observed 
network is very incomplete (1/3 of the true number of edges) or noised by 
a lot of spurious edges. Figure 8 shows examples of networks corrupted by 
removing (top row) or adding (bottom row) edges to the original one and 
used in this experiment. It is visually clear that the information provided 
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Fig. 8. Synthetic data: Examples of corrupted networks used to generate Figure 7. Left 
column: original network used to generate the data of Section 5.2 before removing (top 
row) and adding (bottom row) edges. Middle column: one instance of removing/adding 20 
edges. Right column: one instance of removing/adding 40 edges. 

to our procedure is very different from the one, that is, used to generate 
the data. Again, this is an encouraging result, as it is well known that the 
gene networks available in the literature are missing a lot of interactions and 
often include incorrect information. 

5.3. Branch- and-bound subgraph discovery. To evaluate the performance 
of the subgraph discovery algorithms proposed in Section 4, we generated 
a graph of 100 nodes formed by tightly-connected hubs of sizes sampled 
from a Poisson distribution with parameter 10 and only weak connections 
between these hubs (Figure 9). Such a graph structure is intended to mimic 
the typical topology of gene regulation networks. We randomly selected one 
subgraph of 5 nodes to be nonhomogeneous, with smooth shift in the first 
ko = 3 coefficients. The mean shift was set to zero on the rest of the graph. 
We set the norm of the mean shift to 1 and the covariance matrix to identity, 
so that detecting the shifted subgraph is impossible by just looking at the 
mean shift on the graph. 

We evaluated run-time for full enumeration, the exact branch-and-bound 
algorithm based on Lemma 2 (Section 4.1), and the approximate algorithm 
based on the Euclidean norm (Section 4.2). We also examined run-time on 
data with permuted class labels, as the subgraph discovery procedure is to be 
run on such data to evaluate the number of false positives and adjust for mul- 
tiple testing. Averaging over 20 runs, the full enumeration procedure took 
732 ± 9 seconds per run and the exact branch-and-bound 627 ± 59 seconds 
on the nonpermuted data and 578 ± 100 seconds on permuted data. Over 
100 runs, the approximation at 9 = 0.5 (A m ; n = 0.52) took 204 ± 86 seconds 
(129 ± 91 on permuted data) and the approximation at 6 = 1 (A m ; n = 1.04) 
took 183 ± 106 seconds (40 ± 60 on permuted data). The latter approxima- 
tion missed the nonhomogeneous subgraph in 5% of the runs. 

While neither the exact nor the approximate bounds are efficient enough 
to allow systematic testing on huge graphs for which the full enumeration 
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Fig. 9. Synthetic data: Random graph used in the evaluation of the pruning procedure. 

approach would be impossible, they allow a significant gain in speed, espe- 
cially for permuted data, and will thus prove to be very useful for multiple 
testing adjustment. 

6. Results on cancer gene expression data. We also validated our meth- 
ods using the following two microarray expression data sets: a breast cancer 
data set [Loi et al. (2008)] and a bladder cancer data set [Stransky et al. 
(2006)]. 

Breast cancer data set. The first data set by Loi et al. (2008) comprises 
the expression measures of 15,737 genes for 255 ER+ breast cancer patients 
treated with tamoxifen. Breast tumors are generally classified into three 
main categories [Perou et al. (2000)]: luminal epithelial/ 'ER+, HER2+, and 
triple negative. ER+ tumors typically express estrogen receptors at a high 
level and often rely on estrogen for their growth. Tamoxifen is an antagonist 
of estrogen receptors and therefore prevents its activation by endogenous 
estrogen. Some ER+ tumors, however, keep growing after being treated 
with tamoxifen. An important goal is to detect structured groups of genes 
which are differentially expressed between resistant and sensitive patients, 
as detecting such groups could help understand resistance mechanisms and 
eventually improve ER+ breast tumor treatment. Using distant metastasis- 
free survival as a primary endpoint, 68 patients from this data set are labeled 
as resistant to tamoxifen and 187 are labeled as sensitive to tamoxifen. 
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Bladder cancer data set. The second data set by Stransky et al. (2006) 
consists of the expression measures of 8323 genes for 57 urothelial tumors. 
Urothelial tumors are known to be arising and evolving through two dis- 
tinct pathways, one typically leading to low-grade noninvasive tumors (Ta 
tumors), the other involving more aggressive tumors [Bakkar et al. (2003), 
Knowles (2006)]. These two subtypes, however, are not distinguishable from 
simple markers such as estrogen receptor or HER2 status for breast tumors. 
Mutation of the FGFR3 gene is sometimes used as a proxy, as about 70% 
of the noninvasive tumors carry it. As this information was unfortunately 
not available for this data set, we used the second best proxy which is tu- 
mor stage. We defined two groups: 25 tumors either at the Ta or Tl stages 
(TaTl group) and 32 tumors at the T2, T3 or T4 stages (T2+ group). 
The muscle invasive T2+ tumors are aggressive and present a high risk of 
metastasis, while the Ta tumors have high recurrence level but low chance of 
progression into muscle invasive tumors. Identifying pathways which differ 
in expression between the two subtypes could help understand the disease 
better and improve its treatment. 

6.1. KEGG networks. 

Breast cancer data set. Starting with the breast cancer data set, we tested 
individually 351 connected components from 100 KEGG pathways corre- 
sponding to known gene regulation networks listed in the supplemental ar- 
ticle Supplement B [Jacob, Neuvial and Dudoit (2011b)], using the classical 
Hotelling T 2 -test and the T 2 -test in the new graph-based space retaining 
only the first 20% coefficients (k = 0.2p). This value was the one chosen 
in Rapaport et al. (2007) on the same networks, accordingly with an ar- 
gument based on loss minimization, not on hypothesis testing. The anal- 
ysis of Lopes, Jacob and Wainwright (2011) suggests that the projection 
method (on random subspaces in their case) is quite robust to the choice 
of k. More refined heuristics could be based on eigengaps, that is, on the dis- 
tances between successive eigenvalues. Indeed, matrix perturbation results 
suggest that eigenspaces can vary a lot even under small perturbations of 
the network if the largest discarded eigenvalue is close to the smallest kept 
eigenvalue [Davis and Kahan (1969), Stewart and Sun (1990), Ipsen (2010)]. 
Values of k such that — Xk+i is as large as possible could therefore be 
generally preferable. 

The networks had 36 nodes in average, with a median of 23. For each 
of the 351 graphs, (unadjusted) p-values were computed under the nominal 
F-distributions Fq(j), n\ + ri2 — p — 1) and Fo(k, n\ -\-ri2 — k — 1), respectively. 
The Benjamini and Hochberg (1995) procedure was then applied to control 
the false discovery rate (FDR) at level 0.05. 

Since there is no gold standard regarding which pathways are actually 
involved in endocrine resistance, practical validation of the entire set of 
detected pathways requires advanced biological expertise and further exper- 
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Fig. 10. Breast cancer data set: KEGG prostate cancer pathway. Scaled difference in 
sample mean expression measures between tamoxifen-resistant and sensitive patients, for 
genes in one component of the KEGG prostate cancer pathway. Nodes are colored according 
to the value of the difference in means, with green corresponding to high positive values, 
red to high negative values, and black to 0. Red arrows denote activation, blue arrows 
inhibition. 

iments and is the subject of ongoing collaborations. Nonetheless, inspection 
of our list reveals several pathways which would not have been detected 
(or would have been further down in the list) without accounting for the 
network structure and which have recently been shown to be central in ta- 
moxifen resistance. Many of these pathways involve the Ras/Raf-l/MAPK 
cascade [McGlynn et al. (2009)], like one of the connected components of the 
prostate cancer pathway shown in Figure 10 and one connected component 
of the GnRH pathway shown in Figure 11. The former also involves the 
overexpressed FGFR1, whose amplification was very recently shown to be 
implicated in endocrine therapy resistance by Turner et al. (2010). The lat- 
ter pathway involves overexpressed SRC, which is also a well-studied target 
when trying to prevent tamoxifen resistance [Herynk et al. (2006)]. Both 
pathways have a much smaller p-value when accounting for their graph 
structure than when testing in the original gene space: 10~ 4 vs. 0.02 for 
the prostate cancer pathway and 10" 3 vs. 0.11 for the GnRH signaling 
pathway. This is because the differences in expression of individual genes 
are insufficient to be significant in 36 and 19 dimensions, respectively, while 
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Fig. 11. Breast cancer data set: KEGG GnRH signaling pathway. Scaled difference in 
sample mean expression measures between tamoxifen-resistant and sensitive patients, for 
genes in one component of the KEGG GnRH signaling pathway. Nodes are colored ac- 
cording to the value of the difference in means, with green corresponding to high positive 
values, red to high negative values, and black to 0. Red arrows denote activation, blue 
arrows inhibition. 

the expression shift projected in the first 8 and 4 graph-based directions, 
respectively, is significant. Note that the corresponding p-values for the hy- 
pergeometric enrichment test are 0.15 and 0.31. The complete gene lists of 
the two components are reported in Tables 3 and 4, respectively, in the sup- 
plemental article Supplement C [Jacob, Neuvial and Dudoit (2011c)]. Using 
a system-based approach like our proposed graph-based test therefore allows 
us to recover several known results (which may not have been obvious from 
the same data when looking at each gene individually) and may give insight 
regarding other resistance mechanisms by highlighting connections between 
these results. 

Another example of a network selected only when accounting for graph 
structure is Leukocyte transendothelial migration, shown in Figure 12. To 
the best of our knowledge, this pathway is not specifically known to be in- 
volved in tamoxifen resistance. However, its role in resistance is plausible, 
as leukocyte infiltration was recently found to be involved in breast tumor 
invasion [Man (2010)]; more generally, the immune system and inflamma- 
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Fig. 12. Breast cancer data set: KEGG leukocyte trans endothelial migration pathway. 
Scaled difference in sample mean expression measures between tamoxij 'en-resistant and 
sensitive patients, for genes in one component of the KEGG leukocyte transendothelial 
migration pathway. Nodes are colored according to the value of the difference in means, 
with green corresponding to high positive values, red to high negative values, and black to 0. 
Red arrows denote activation, blue arrows inhibition. 

tory response are closely related to the evolution of cancer. Here again, the 
p- value of the hypergeometric test is extremely high (0.31). The entire list 
of genes in this component is reported in Table 5 in the supplemental arti- 
cle Supplement C [Jacob, Neuvial and Dudoit (2011c)]. 

Bladder cancer data set. Testing the same KEGG networks on the blad- 
der cancer data set, we immediately notice that several gene sets which are 
well known to be specific of one of the two bladder cancer progression path- 
ways have much lower p- values under our graph-based approach than using 
the Hotelling T 2 -statistic. This is the case, in particular, for the p53 signal- 
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Fig. 13. Bladder cancer data set: KEGG p53 signaling pathway. Scaled difference in 
sample mean expression measures between T2+ and TaTl tumors, for genes in one com- 
ponent of the KEGG p53 signaling pathway. Nodes are colored according to the value of the 
difference in means, with green corresponding to high positive values, red to high negative 
values, and black to 0. Red arrows denote activation, blue arrows inhibition. 



ing pathway [Spruck et al. (1994), Sanchez-Carbayo et al. (2006)], which is 
displayed in Figure 13 and for which the graph-based procedure outputs a p- 
value of 8.5 x 10 -6 vs. 0.3 for the classical T 2 -statistic. The TP53 gene itself 
is overexpressed in invasive (T2+) tumors, van Rhijn et al. (2004) suggested 
that FGFR3 and TP53 mutations characterize the two growth pathways and 
are mutually exclusive. A more recent study [Hernandez et al. (2005)] contra- 
dicts the exclusion, but the observed underexpression of TP53 in the invasive 
group could be coherent with its typical mutation in invasive tumors. Genes 
coding for cyclins, such as CCNB1, CCNB2 and CDC2, are overexpressed. 
Cyclins are positively involved in cell proliferation, which is coherent with 
their overexpression in invasive tumors, as it was already observed for other 
genes of the cyclin family [Levidou et al. (2010)]. IGF1 is also overexpressed 
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Fig. 14. Bladder cancer data set: KEGG ErbB signaling pathway. Scaled difference in 
sample mean expression measures between T2+ and TaTl tumors, for genes in one com- 
ponent of the KEGG ErbB signaling pathway. Nodes are colored according to the value 
of the difference in means, with green corresponding to high positive values, red to high 
negative values, and black to 0. Red arrows denote activation, blue arrows inhibition. 



in T2+ tumors, known to induce cell proliferation [Dunn et al. (1997)] and 
was selected as a prognosis predictor for bladder cancer in Mitra et al. (2009). 

We also observe a much lower p-value using our procedure than using the 
classical T 2 -statistic (2.3 x 10~ 5 vs. 0.066) for the ErbB signaling pathway, 
shown in Figure 14 and known to behave differently in the two bladder cancer 
growth pathways [Mellon et al. (1996)]. In particular, the network involves 
the PIK3, RAS and MAPK genes, which are known to be oncogenes specific 
to one of the growth pathways [Eswarakumar, Lax and Schlessinger (2005)]. 

Finally, changes in the TGF-j3 signaling pathway are also known to be 
related to the aggressiveness of bladder cancers [Hung et al. (2008)]. The 
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Fig. 15. Bladder cancer data set: KEGG TGF-/3 signaling pathway. Scaled difference 
in sample mean expression measures between T2+ and TaTl tumors, for genes in one 
component of the KEGG TGF-/3 signaling pathway. Nodes are colored according to the 
value of the difference in means, with green corresponding to high positive values, red to 
high negative values, and black to 0. Red arrows denote activation, blue arrows inhibition. 

network is shown in Figure 15 and here again our procedure results in a much 
lower p- value than the Hotelling test (2.6 x 10~ 6 vs. 0.049). 

Unsurprisingly, these three networks have a relatively large size with re- 
spect to the low sample size of this data set and several of their genes show 
only very moderate differential expression when tested individually. 

6.2. NCI networks. We also tested 75 connected components coming 
from gene networks of the NCI Pathway IntegrationDatabase. 2 The NCI net- 
works considered are listed in the supplemental article Supplement B [Jacob, 
Neuvial and Dudoit (2011b)]. Unlike KEGG pathways for which the Biocon- 
ductor R package KEGGgraph had already been developed, NCI pathways 
were not readily available as R objects. We therefore developed NCIgraph 



2 http : //pid.nci .nih.gov. 
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[Jacob (2011)], a Bioconductor R package which converts pathways available 
in BioPAX format to R objects. In addition, instead of importing a gene net- 
work as is into R, we provide an option to convert as well as possible the 
original network, whose nodes can represent proteins, protein complexes or 
concepts like transport or biochemical reactions, into one whose nodes cor- 
respond to genes and whose edges represent direct or indirect interactions 
at the expression level. For instance, if protein A is known to activate pro- 
tein B, which is a transcription factor for gene C, a relevant network in terms 
of gene expression should be A and B pointing to C, whereas the BioPAX 
network will most likely be represented as A pointing to B pointing to C. 
As discussed in Section 5.2, our method is robust to irrelevant edges in the 
graph. Such a transformation is nonetheless important, since the method 
essentially uses biological networks as a prior on the covariance structure 
of gene expression. After this transformation, however, most networks have 
much simpler topologies, typically with all genes pointing to one or a few 
targets. As a result, Laplacian eigenvalues often have high multiplicities, 
which makes the effect of filtering less drastic. 3 In addition, the networks we 
consider here have much smaller size than the KEGG networks on average 
(8.9 vs. 36 for means, 7.5 vs. 23 for medians), which also explain the milder 
difference between results before and after dimensionality reduction. 

For the breast cancer data, the 75 connected components we consider are 
those which have a nonempty intersection with the genes in this microarray 
data set. As for the KEGG networks, we compare the classical Hotelling 
T 2 -test and the T 2 -test in the new graph-based space retaining only the 
first 20% coefficients (fc = 0.2p). 

As an example, NFkB activation by Nontypeable Hemophilus influenzae 
shown in Figure 16 includes 21 genes from the breast cancer data set, but 
keeping the first 20% of the eigenvalues amounts to keeping 16 dimensions 
because of multiplicities. As a consequence, the p- value obtained after filter- 
ing is only slightly lower than that before filtering. Here again, the original 
context of study for this pathway has nothing to do with breast cancer: the 
purpose was to uncover the inflammation and mucin overproduction mech- 
anism caused by a particular bacteria. Nevertheless, this network contains 
several genes which are either known actors of endocrine resistance or whose 
activity can be directly linked to the resistance phenomenon. Moreover, as 
one may expect, most of the observed gene- wise differential expression is co- 
herent with the annotated interactions. On the lower part of the figure, IL1B 
is shown to be overexpressed in sensitive patients. Consistent with this fact, 
its negative regulator p38 (MAPK11 and MAPK14) is downregulated in sen- 
sitive patients and its positive regulator CREBBP is upregulated. Note that 
DUSP1 was incorrectly annotated as a negative regulator in the automatic 



If the eigenvalue has a very high multiplicity, for example, then even the most 
extreme filtering still retains a large number of dimensions. 
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Fig. 16. Breast cancer data set: NCI Nfltb activation by nontypeable hemophilus influen- 
zae pathway. Scaled difference in sample mean expression measures between tamoxifen-re- 
sistant and sensitive patients, for genes in one component of the NCI imported BioCarta 
Nfkb activation by nontypeable hemophilus influenzae pathway. Nodes are colored according 
to the value of the difference in means, with green corresponding to high positive values, 
red to high negative values, and black to 0. Red arrows denote activation, blue arrows 
inhibition. 

network conversion process of NCIgraph but is actually a positive regulator, 
as it is involved in the inactivation of p38. NR3C1 is involved in the transcrip- 
tion of DUSP1 and is also upregulated in sensitive patients. A few inconsis- 
tencies can be observed, like MAP2K3 and MAP2K6 which are negative reg- 
ulators of IL1B, yet are overexpressed in sensitive patients. Recall, however, 
that the criterion we use for coherence is based on the difference between 
the expression of each gene and the (interaction-sign corrected) average ex- 
pression of its regulators. The second main output of the pathway, MUC2, is 
downregulated in sensitive patients, which makes sense both in terms of the 
expression of its negative regulator TGFBR2, which is upregulated, and the 
already observed fact [Srinivasan et al. (2007)] that the estrogen receptor 
upregulates MUC2 and that tamoxifen could block its expression. 
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The role of MUC2 in resistance to tamoxifen treatment of ductal carci- 
noma does not seem to be clearly established. Overexpression of MUC2 is 
sometimes found to be mildly correlated with good prognosis [Walsh et al. 
(1993), Rakha et al. (2005)], but this may be caused by its correlation with 
ER+ status. Its overexpression in resistant patients observed in this data 
set may well be noncausal, but would deserve further investigation. As for 
TGFBR2, inactivating mutations of the gene have been reported to be asso- 
ciated with recurrence and tamoxifen resistance [Liicke et al. (2001)], which 
is coherent with underexpression in resistant patients. Regarding IL1B, the 
main output of the pathway, its overexpression has been shown to be related 
to inhibition of cancer growth through apoptosis [Roy, Sarkar and Felty 
(2006)]. DUSP1 is a known negative regulator of cell proliferation and over- 
expression of p38 is known to be related to tamoxifen resistance [Gutierrez 
et al. (2005)]. Interestingly, NR3C1 activity has also been described by Wu 
et al. (2005) as being related to breast cancer cell survival through its induc- 
tion of MAPK1 expression, which illustrates the interest of studying differen- 
tial expression patterns at a system level rather than at the single-gene level. 

It is also important to note that at least two interpretations can be given 
for the fact that sensitive patients have several gene expression patterns 
corresponding to known factors of good prognosis. Some of these patterns 
may be caused by the treatment, in which case understanding how tamoxifen 
affects these genes in some patients and not in others may be a proxy to 
understanding resistance mechanisms. Some of the patterns though may also 
have been caused by phenotypic traits of the sensitive patients, leading to 
better prognosis but without any link to the treatment. 

Another small but relevant example is the sonic hedgehog receptor ptcl 
regulates cell cycle pathway shown in Figure 17, which is entirely overex- 
pressed in resistant patients, yielding a 10-fold change between the p-value 
with and without dimensionality reduction. The genes in this pathway are 
known to be related to tamoxifen resistance: CCNB1 is related to prolif- 
eration and is part of several existing tamoxifen-resistance signatures [Paik 
et al. (2004)] and inhibition of CDC2 was already proposed as an alternative 
treatment for endocrine resistant tumors [Johnson et al. (2010)]. 

6.3. Branch- and-bound subgraph discovery. We ran our branch-and-bound 
nonhomogeneous subgraph discovery procedure on the cell cycle pathway, 
whose largest connected component, after restriction to edges of known sign 
(inhibition or activation), has 86 nodes and 442 edges. Specifically, we sought 
to detect differentially expressed subgraphs of size q = 5, after preselecting 
those for which the squared Euclidean norm of the empirical shift exceeds 
8 = 0.1; for a test in the first k = 3 components at level a = 10 -4 , this corre- 
sponds to A m ; n < 0.23 and to an expected removal of 95% of the subgraphs 
under the approximation that the squared Euclidean norm of the subgraphs 
follows a x|-distribution. 
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Fig. 17. Breast cancer data set: NCI sonic hedgehog receptor ptcl regulates cell cycle 
pathway. Scaled difference in sample mean expression measures between tamoxij en-resis- 
tant and sensitive patients, for genes in one component of the NCI imported BioCarta 
sonic hedgehog receptor ptcl regulates cell cycle pathway. Nodes are colored according to 
the value of the difference in means, with green corresponding to high positive values, red to 
high negative values, and black to 0. Red arrows denote activation, blue arrows inhibition. 

For a = lCT 4 , out of 100 runs on permuted data, only 9 rejected the null 
hypothesis for at least one subgraph. More precisely, 4 of these 9 runs de- 
tected 1 subgraph and the others detected 3, 6, 6, 21 and 26 subgraphs. In 
contrast, 41 overlapping subgraphs (Figure 18) were detected on the orig- 
inal data, corresponding to a connected subnetwork of 25 genes. Some of 
the genes belonging to these networks exhibit large individual differential 
expression, namely, TP53 whose mutation has been long known to be in- 
volved in tamoxifen resistance [Andersson et al. (2005), Fernandez-Cuesta 
et al. (2010)]. Accordingly, its negative regulator MDM2 is overexpressed 
and its positive regulator CREBBP is underexpressed. E2F1, whose ex- 
pression level was recently shown to be involved in tamoxifen resistance 
[Louie et al. (2010)], is also part of the identified network, as well as CCND1 
[Barnes (1997), Musgrove and Sutherland (2009)]. Some other genes in the 
network have quite low t-statistics and would not have been detected indi- 
vidually. This is the case of CCNE1 and CDK2, which were also described 
in Louie et al. (2010) as part of the same mechanism as E2F1. Similarly, 
CDKN1A was recently found to be involved in anti-estrogen treatment re- 
sistance [Musgrove and Sutherland (2009)] and in ovarian cancer, which is 
also a hormone-dependent cancer [Cunningham et al. (2009)]. Interestingly, 
RBX1, a gene coding for a RING-domain E3 ligase known to be involved in 
degradation of estrogen receptor a (ERa) [Ohtake et al. (2007)], appears to 
be overexpressed in resistant patients. This fact may suggest that some of 
the resistant ER+ patients had fewer receptors and, as a result, their tumors 
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Fig. 18. Breast cancer data set: Subgraph discovery. Difference in sample mean expres- 
sion measures between tamoxif en-resistant and sensitive patients, for genes in the two 
overlapping subgraphs detected at a = 1CP 4 . Nodes are colored according to the value of 
the difference in means, with green corresponding to high positive values, red to high neg- 
ative values, and black to 0. Red arrows denote activation, blue arrows inhibition. 

were relying less on estrogen for their growth, hence, the limited effect of a se- 
lective estrogen receptor modulator (SERM) like tamoxifen. The networks 
also contain CDK4, whose inhibition was described in Sutherland and Mus- 
grove (2009) as acting synergistically with tamoxifen or trastuzumab. More 
generally, a large part of the network displayed in Figure 2A of Musgrove 
and Sutherland (2009) is included in our network, along with other known 
actors of tamoxifen resistance. Admittedly, selecting an important regula- 
tor like TP53 is not a surprising result, but our system-based approach to 
pathway discovery directly identifies an important set of interacting genes 
and may therefore prove to be more efficient than iterative individual iden- 
tification of single actors. 
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7. Software implementation. The graph-structured test of Section 3 is 
implemented in the R software package DEGraph, released through the Bio- 
conductor Project (release 2.7). Instructions for download and installation 
are available at http://www.bioconductor.org. Note that implementa- 
tions of the branch-and-bound algorithms are not yet included in this pack- 
age, but are available upon request. 

As mentioned in Section 6.2, we also developed NCIgraph [Jacob (2011)], 
a Bioconductor R package which converts pathways available in BioPAX 
format to R objects with various preprocessing options. 

8. Discussion. We developed a graph-structured two-sample test of means, 
for problems in which the distribution shift is assumed to be smooth on 
a given graph. We proved quantitative results on power gains for such 
smooth-shift alternatives and devised branch-and-bound algorithms to sys- 
tematically apply our test to all the subgraphs of a large graph, without 
enumerating and testing these subgraphs one-by-one. The first algorithm 
is exact and reduces the number of explicitly tested subgraphs. The sec- 
ond one is approximate, with no false positives and a quantitative result 
on the type of false negatives (with respect to the exact algorithm). The 
nonhomogeneous subgraph discovery method involves performing a large 
number of tests, with highly-dependent test statistics. However, as the ac- 
tual number of tested hypotheses is unknown, standard multiple testing 
procedures are not directly applicable. Instead, we use a permutation proce- 
dure to estimate the distribution of the number of false positive subgraphs. 
Such resampling procedures (bootstrap or permutation) are feasible due to 
the manageable run-time of the pruning algorithms of Section 4. Results on 
synthetic data illustrate the good power properties of our graph-structured 
test under smooth-shift alternatives, as well as the good performance of our 
branch-and-bound-like algorithms for subgraph discovery. Very promising 
results are also obtained on the gene expression data sets of Loi et al. (2008) 
and Stransky et al. (2006). 

Future work should investigate the use of other bases, such as graph- 
wavelets [Hammond, Vandergheynst and Gribonval (2009)], which would 
allow the detection of shifts with spatially-located nonsmoothness, for ex- 
ample, to take into account errors in existing networks. As for the cutoff 
selection, more systematic procedures should be considered, for example, 
the two-step method proposed in Das Gupta and Permian (1974), adaptive 
approaches as in Fan and Lin (1998) or heuristics based on the eigengap as 
mentioned in Section 6. The pruning algorithm would naturally benefit from 
sharper bounds. Such bounds could be obtained by controlling the condition 
number of all covariance matrices, using, for example, regularized statistics 
which still have known nonasymptotic distributions, such as those of Tai and 
Speed (2009). Concerning multiple testing, procedures should be devised to 
exploit the dependence structure between the tested subgraphs and to deal 
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with the unknown number of tests. The proposed approach could also be en- 
riched to take into account different types of data, for example, copy number 
for the detection of DE gene pathways. More subtle notions of smoothness, 
for example, "and" (resp., "or") logical relations [Vaske et al. (2010)], could 
also be included to represent regulation mechanisms where the simultaneous 
presence of two transcription factors (resp., the presence of one or the other) 
is necessary to activate the transcription of another gene. Other applications 
of two-sample tests with smooth-shift on a graph include fMRI and eQTL 
association studies. For fMRI data, the goal would be to detect whether the 
brain activity changes between two conditions, using the prior information 
that parts of the brain which are close up to brain convolutions or known 
connection patterns should exhibit the same kind of change. One could also 
want to identify specific areas of the brain whose activity changes between 
two conditions. In eQTL studies, people are often interested in finding genes 
whose expression is influenced by single-nucleotide polymorphisms (SNPs), 
resulting in a large number of individual tests which often need to be ag- 
gregated a posteriori at the pathway level. Our method could be used to 
identify pathways whose expression is associated with particular SNPs. 

Finally, it would be of interest to compare our testing approach with 
structured sparse learning (which we briefly described in Section 1) for the 
purpose of identifying expression signatures that are predictive of drug re- 
sistance. Methods should be compared in terms of prediction accuracy and 
stability of the selected genes across different data sets, a central and diffi- 
cult problem in the design of such signatures [Ein-Dor et al. (2005), He and 
Yu (2010), Haury, Jacob and Vert (2010), Haury, Gestraud and Vert (2011)]. 
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SUPPLEMENTARY MATERIAL 

Supplement A: Technical results and proofs 

(DOI: 10. 1214/1 1-AOAS528SUPPA; .pdf). This section contains our techni- 
cal results (Lemma and Corollaries) on gain in power along with their proofs. 
It also contains the upper bound used in the branch and bound algorithm 
with its proof. Finally, it contains the lemma characterizing the subgraphs 
that would be missed by the approximated subgraph discovery algorithm 
presented in Section 4.2 along with its proof. 

Supplement B: Pathways considered in the experiments 

(DOI: 10.1214/11-AOAS528SUPPB; .pdf). This section lists the names of 
the pathways considered in the experiments. 
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Supplement C: Gene lists (DOI: 10.1214/11-AOAS528SUPPC; .pdf). This 
section lists the genes belonging to each of the pathways studied in detail in 
the experiments along with their i-statistic and corresponding p-value. 
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